AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.
ID: Customer IDAge: Customer’s age in completed yearsExperience: #years of professional experienceIncome: Annual income of the customer (in thousand dollars)ZIP Code: Home Address ZIP code.Family: the Family size of the customerCCAvg: Average spending on credit cards per month (in thousand dollars)Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/ProfessionalMortgage: Value of house mortgage if any. (in thousand dollars)Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)Online: Do customers use internet banking facilities? (0: No, 1: Yes)CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)# Installing the libraries with the specified version.
!pip install numpy==1.25.2 pandas==1.5.3 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 sklearn-pandas==2.2.0 -q --user
Note:
After running the above cell, kindly restart the notebook kernel (for Jupyter Notebook) or runtime (for Google Colab), write the relevant code for the project from the next cell, and run all cells sequentially from the next cell.
On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in this notebook.
# to load and manipulate data
import pandas as pd
import numpy as np
# to visualize data
import matplotlib.pyplot as plt
import seaborn as sns
# to split data into training and test sets
from sklearn.model_selection import train_test_split
# to build decision tree model
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# to tune different models
#from sklearn.model_selection import GridSearchCV
# to compute classification metrics
from sklearn.metrics import (
confusion_matrix,
accuracy_score,
recall_score,
precision_score,
f1_score,
)
# to suppress unnecessary warnings
import warnings
warnings.filterwarnings("ignore")
# uncomment and run the following line if using Google Colab
from google.colab import drive
drive.mount('/content/drive')
# Load the data
loan_data = pd.read_csv('/content/drive/MyDrive/Python Course/Loan_Modelling.csv')
# Make a copy of the data; keep the original as a backup
data = loan_data.copy()
# View the first 5 rows of the data
data.head()
# View the last 5 rows of the data
data.tail()
# Shape of the data
data.shape
# View the columns and datatypes of the data
data.info()
# Check the statistical summary of the data
# data.describe(include="all")
data.describe().T
At least half of the customers use internet banking facilities
NOTE: The Experience column has an anomaly - minimum value => -3, which is a negative number. This needs to be investigated further.
# Check for null values
data.isnull().sum()
# Checking for duplicate values
data.duplicated().sum()
# Check to see if the ID column is unique
data.ID.nunique()
# Drop the ID column. All values are unique.
data.drop(columns=["ID"], inplace=True)
# Check to confirm that the ID column has been dropped
data.head()
# Do a check for anomalies in each column data
#data["Experience"].unique()
#data["Online"].unique()
#data["CreditCard"].unique()
#data["Family"].unique()
#data["Education"].unique()
#data["CCAvg"].unique()
#data["Mortgage"].unique()
#data["Income"].unique()
#data["Age"].unique()
#data["CD_Account"].unique()
#data["Securities_Account"].unique()
#data["Personal_Loan"].unique()
#data["ZipCode"].unique()
# NOTE: Only Experience has anomalies as stated prior with -ve numbers
data["Experience"].unique()
# Check for anomalous data in the Experience column
data["Experience"].unique() # Found -1, -2, -3
# View this negative values data
data[data["Experience"] < 0]["Experience"].unique()
# Remove the -ve in the data
data["Experience"].replace(-1, 1, inplace=True)
data["Experience"].replace(-2, 2, inplace=True)
data["Experience"].replace(-3, 3, inplace=True)
# Check to confirm that the -ve values have been removed
data["Experience"].unique()
Questions:
- Answer: The Mortgage data is heavily right skewed and there are quite a lot of outliers on the higher side of the data.
# Generate Histogram for Mortgage
sns.histplot(data = data, x = 'Mortgage')
plt.title('Mortgage')
plt.xlabel('Mortgagge')
plt.show()
# Generate Boxplot for Mortgage
sns.boxplot(data = data, x= 'Mortgage')
plt.title('Boxplot of Mortgage')
plt.xlabel('Mortgage')
plt.show()
# How many customers have credit cards?
data.CreditCard.value_counts() # 0: 3530; 1: 1470
Answers for the following are later down in the notebook
# Check the number of unique zip codes
data["ZIPCode"].nunique()
# The unique number of zip codes are quite a bit; we can do some feature engineering to reduce the unique number of zip codes
# One option is using just the first few characters of the zip codes, say the first two characters; this will give less numbers of unique number of zip codes
data["ZIPCode"] = data["ZIPCode"].astype(str).str[:2]
data["ZIPCode"].nunique()
# Some of the columns should act as category fields although they might have numeric data in them
# So, convert the data type of categorical features to 'category'
category_cols = [
"Education",
"Personal_Loan",
"Securities_Account",
"CD_Account",
"Online",
"CreditCard",
"ZIPCode",
]
data[category_cols] = data[category_cols].astype("category")
# Function to plot both histogram and boxplot for numeric columns
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
For Boxplot:
Violet triangle/star indicates the mean value of the data
For Histogram:
Horizontal green dashed line (--) represents the mean/average of the data
Horizontal solid black line (-) represents the median of the data
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
num_col = data.select_dtypes(include=np.number).columns.tolist()
num_col
# Create Histograms and Boxplots for numeric fields
num_col = data.select_dtypes(include=np.number).columns.tolist()
for item in num_col:
histogram_boxplot(data, item)
Observations
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# create labeled barplots for categorical fields
for item in category_cols:
if item != "Personal_Loan":
labeled_barplot(data, item, perc=True)
# This chart gives a better understanding of the Family column
labeled_barplot(data, "Family", perc=True)
# Chart the target field - Personal Loan
labeled_barplot(data, "Personal_Loan", perc=True)
Observations
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1), title="Personal Loan")
plt.show()
### function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
stat="density",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
# Analyze the corelationship between the numeric variables
plt.figure(figsize=(15, 7))
sns.heatmap(data.corr(numeric_only=True), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
Observations
# Scatter plot matrix
plt.figure(figsize=(12, 8))
sns.pairplot(data, vars=num_col, hue='Personal_Loan', diag_kind='kde');
Observation
Check how a Customer's interest in purchasing a loan varies with the categorical variables
for item in category_cols:
if item != "Personal_Loan":
stacked_barplot(data, item, "Personal_Loan")
Observations
for item in num_col:
if item != "Personal_Loan":
distribution_plot_wrt_target(data, item, "Personal_Loan")
Observations
The Age distribution seems evenly distributed for customers who either have personal loans or not. The median of the age of the customers for both groups is ~45 years.
The distribution for Experience seems evenly distributed for customers who either have personal loans or not. The median of the experience of the customer for both groups is ~20 years of experience.
The data distribution for Income is heavily right-skewed for customers that do not have personal loans, while the distribution is slightly left-skewed for customers that have personal loans.
Amongst the customers that do not have personal loans, the median Income is ~60k dollars and there are a number of outliers. Amongst the customers that have personal loans, the median is ~$140K.
Amongst those that do not have personal loans, more families have 1 or 2 people, while amongst those that have personal loans, more families have 3 or 4 people. The median of families that do not have personal loans is 2; The median of families that have personal loans is 3.
The data distribution for CCAvg (monthly average spending for credit cards) for those that do not have personal loans is heavily right-skewed and there are a number of outliers for this data. The median for this group is ~1.5K dollars.
The data distribution for CCAvg (monthly average spending for credit cards) for those that have personal loans is right-skewed and there are a number of outliers for this data. The median for this group is ~$4K.
The data distribution for Mortgages is heavily right-skewed and there are a lot of outliers for both those that have personal loans and those that do not have personal loans.
Questions:
3. What are the attributes that have a strong correlation with the target attribute (personal loan)?
4. How does a customer's interest in purchasing a loan vary with their age?
5. How does a customer's interest in purchasing a loan vary with their education?
# Get the parameters to see how much of outliers are in the data
Q1 = data.select_dtypes(include=["float64", "int64"]).quantile(0.25) # To find the 25th percentile and 75th percentile.
Q3 = data.select_dtypes(include=["float64", "int64"]).quantile(0.75)
IQR = Q3 - Q1 # Inter Quantile Range (75th perentile - 25th percentile)
lower = (
Q1 - 1.5 * IQR
) # Finding lower and upper bounds for all values. All values outside these bounds are outliers
upper = Q3 + 1.5 * IQR
# Determine how much outliers in each numeric column
(
(data.select_dtypes(include=["float64", "int64"]) < lower)
| (data.select_dtypes(include=["float64", "int64"]) > upper)
).sum() / len(data) * 100
# Get a copy of the data
data_copy = data.copy()
# Prepre for the Model
# Define the explanatory (independent) and response (dependent) variables
X = data.drop(["Personal_Loan"], axis=1)
Y = data["Personal_Loan"]
X = pd.get_dummies(X, columns=["ZIPCode", "Education"], drop_first=True)
X = X.astype(float)
# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1
)
print("Shape of training set:", X_train.shape)
print("Shape of test set:", X_test.shape, '\n')
print("Percentage of classes in training set:")
print(100*y_train.value_counts(normalize=True), '\n')
print(y_train.value_counts(normalize=True), '\n')
print("Percentage of classes in test set:")
print(100*y_test.value_counts(normalize=True))
print(y_test.value_counts(normalize=True))
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
model = DecisionTreeClassifier(criterion="gini", random_state=1)
model.fit(X_train, y_train)
confusion_matrix_sklearn(model, X_train, y_train)
decision_tree_perf_train = model_performance_classification_sklearn(
model, X_train, y_train
)
decision_tree_perf_train
# list of feature names in X_train
feature_names = list(X_train.columns)
print(feature_names)
# list of feature names in X_train
feature_names = list(X_train.columns)
# set the figure size for the plot
plt.figure(figsize=(20, 30))
# plotting the decision tree
out = tree.plot_tree(
model, # decision tree classifier model
feature_names=feature_names, # list of feature names (columns) in the dataset
filled=True, # fill the nodes with colors based on class
fontsize=9, # font size for the node text
node_ids=False, # do not show the ID of each node
class_names=None, # whether or not to display class names
)
# add arrows to the decision tree splits if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black") # set arrow color to black
arrow.set_linewidth(1) # set arrow linewidth to 1
# displaying the plot
plt.show()
# Print a text report showing the rules of a decision tree
print(
tree.export_text(
model, # specify the model
feature_names=feature_names, # specify the feature names
show_weights=True # specify whether or not to show the weights associated with the model
)
)
# Determine the importance of features in the tree building
# (The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature.
# It is also known as the Gini importance )
print(
pd.DataFrame(
model.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
# Plot the chart to show the importance of features
importances = model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
We can see the order of importance of the columns: Income, Family, Education, CCAvg, CD_Account, Age, Experience, then the rest.
confusion_matrix_sklearn(model, X_test, y_test)
decision_tree_perf_test = model_performance_classification_sklearn(
model, X_test, y_test
)
decision_tree_perf_test
data_copy.columns
# Use the same starting point data
data = data_copy
# Drop Experience as it is correlated with Age
X = data.drop(["Personal_Loan", "Experience"], axis=1)
Y = data["Personal_Loan"]
X = pd.get_dummies(X, columns=["ZIPCode", "Education"], drop_first=True)
X = X.astype(float)
# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1
)
print("Shape of training set:", X_train.shape)
print("Shape of test set:", X_test.shape, '\n')
print("Percentage of classes in training set:")
print(100*y_train.value_counts(normalize=True), '\n')
print(y_train.value_counts(normalize=True), '\n')
print("Percentage of classes in test set:")
print(100*y_test.value_counts(normalize=True))
print(y_test.value_counts(normalize=True))
model1 = DecisionTreeClassifier(criterion="gini", random_state=1)
model1.fit(X_train, y_train)
confusion_matrix_sklearn(model1, X_train, y_train)
decision_tree_perf_train = model_performance_classification_sklearn(
model1, X_train, y_train
)
decision_tree_perf_train
feature_names = list(X_train.columns)
print(feature_names)
# list of feature names in X_train
feature_names = list(X_train.columns)
# set the figure size for the plot
plt.figure(figsize=(20, 30))
# plotting the decision tree
out = tree.plot_tree(
model1, # decision tree classifier model
feature_names=feature_names, # list of feature names (columns) in the dataset
filled=True, # fill the nodes with colors based on class
fontsize=9, # font size for the node text
node_ids=False, # do not show the ID of each node
class_names=None, # whether or not to display class names
)
# add arrows to the decision tree splits if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black") # set arrow color to black
arrow.set_linewidth(1) # set arrow linewidth to 1
# displaying the plot
plt.show()
print(tree.export_text(model1, feature_names=feature_names, show_weights=True))
# Determine the importance of features in the tree building
# (The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature.
# It is also known as the Gini importance )
print(
pd.DataFrame(
model1.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
# Plot the chart to show the importance of features
importances = model1.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
confusion_matrix_sklearn(model1, X_test, y_test)
decision_tree_perf_test = model_performance_classification_sklearn(model1, X_test, y_test)
decision_tree_perf_test
# Define the parameters of the tree to iterate over
max_depth_values = np.arange(2, 7, 2)
max_leaf_nodes_values = [50, 75, 150, 250]
min_samples_split_values = [10, 30, 50, 70]
'''
# I tried these parameters also and
# max_depth_values=2, max_leaf_nodes_values=10, min_samples_split_values=10, produced a Recall of 1
max_depth_values = np.arange(2, 11, 2)
max_leaf_nodes_values = np.arange(10, 51, 10)
min_samples_split_values = np.arange(10, 51, 10)
'''
# Initialize variables to store the best model and its performance
best_estimator = None
best_score_diff = float('inf')
best_test_score = 0.0
# Iterate over all combinations of the specified parameter values
for max_depth in max_depth_values:
for max_leaf_nodes in max_leaf_nodes_values:
for min_samples_split in min_samples_split_values:
# Initialize the tree with the current set of parameters
estimator = DecisionTreeClassifier(
max_depth=max_depth,
max_leaf_nodes=max_leaf_nodes,
min_samples_split=min_samples_split,
class_weight='balanced',
random_state=42
)
# Fit the model to the training data
estimator.fit(X_train, y_train)
# Make predictions on the training and test sets
y_train_pred = estimator.predict(X_train)
y_test_pred = estimator.predict(X_test)
# Calculate recall scores for training and test sets
train_recall_score = recall_score(y_train, y_train_pred)
test_recall_score = recall_score(y_test, y_test_pred)
# Calculate the absolute difference between training and test recall scores
score_diff = abs(train_recall_score - test_recall_score)
# Update the best estimator and best score if the current one has a smaller score difference
if (score_diff < best_score_diff) & (test_recall_score > best_test_score):
best_score_diff = score_diff
best_test_score = test_recall_score
best_estimator = estimator
# Print the best parameters
print("Best parameters found:")
print(f"Max depth: {best_estimator.max_depth}")
print(f"Max leaf nodes: {best_estimator.max_leaf_nodes}")
print(f"Min samples split: {best_estimator.min_samples_split}")
print(f"Best test recall score: {best_test_score}")
# Fit the best algorithm to the data.
estimator = best_estimator
estimator.fit(X_train, y_train)
confusion_matrix_sklearn(estimator, X_train, y_train)
dtree_pre_pruning_train_perf = model_performance_classification_sklearn(
estimator, X_train, y_train
)
dtree_pre_pruning_train_perf
Visualizing the Decision Tree
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
estimator,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Print text report showing the rules of a decision tree -
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
estimator.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Check Performance of Test Data
confusion_matrix_sklearn(estimator, X_test, y_test)
dtree_pre_pruning_test_perf = model_performance_classification_sklearn(
estimator, X_test, y_test
)
dtree_pre_pruning_test_perf
Observations For Pre-Pruning
Interestingly, with these model parameters: Max depth: 2, Max leaf nodes: 50, Min samples split: 10, we see that the Recall for both the training and test data is 1. That is, there are no False Negatives. Also, it is note worthy that the Accuracy, Precision and F1 scores are not too far off.
I also tried some other parameters and found at least one other model that gave a Recall of 1. There parameters are: Max depth: 2, Max leaf nodes: 10, Min samples split: 10
I really I amnot comfortable with the fact that this model, gives a Recall score of 1, which looks like it is perfect for our purposes. It is too perfect!!! Also, the fact that the Accuracy, Precision and the F1 scores, are quite low is suspect. Lastly, our instructor said that the difference between Recall and Precision should be like 20%.
I will try other parameters, so bear with me. I will move faster with this one.
# Define the parameters of the tree to iterate over
class_weight = [None, "balanced"]
max_depth_values = [3, 5, 7, 9, 10]
max_leaf_nodes_values = [50, 100, 200, 500, 1000]
min_samples_split_values = [2, 10, 50, 100]
# Initialize variables to store the best model and its performance
best_estimator = None
best_score_diff = float('inf')
best_test_score = 0.0
# Iterate over all combinations of the specified parameter values
for max_depth in max_depth_values:
for max_leaf_nodes in max_leaf_nodes_values:
for min_samples_split in min_samples_split_values:
# Initialize the tree with the current set of parameters
estimatorZ = DecisionTreeClassifier(
max_depth=max_depth,
max_leaf_nodes=max_leaf_nodes,
min_samples_split=min_samples_split,
class_weight='balanced',
random_state=42
)
# Fit the model to the training data
estimatorZ.fit(X_train, y_train)
# Make predictions on the training and test sets
y_train_pred = estimatorZ.predict(X_train)
y_test_pred = estimatorZ.predict(X_test)
# Calculate recall scores for training and test sets
train_recall_score = recall_score(y_train, y_train_pred)
test_recall_score = recall_score(y_test, y_test_pred)
# Calculate the absolute difference between training and test recall scores
score_diff = abs(train_recall_score - test_recall_score)
# Update the best estimator and best score if the current one has a smaller score difference
if (score_diff < best_score_diff) & (test_recall_score > best_test_score):
best_score_diff = score_diff
best_test_score = test_recall_score
best_estimator = estimatorZ
# Print the best parameters
print("Best parameters found:")
print(f"Max depth: {best_estimator.max_depth}")
print(f"Max leaf nodes: {best_estimator.max_leaf_nodes}")
print(f"Min samples split: {best_estimator.min_samples_split}")
print(f"Best test recall score: {best_test_score}")
# Fit the best algorithm to the data.
estimator1 = best_estimator
estimator1.fit(X_train, y_train)
confusion_matrix_sklearn(estimator1, X_train, y_train)
dtree_pre_pruning_train_perf1 = model_performance_classification_sklearn(
estimator1, X_train, y_train
)
dtree_pre_pruning_train_perf1
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
estimator1,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Print text report showing the rules of a decision tree -
print(tree.export_text(estimator1, feature_names=feature_names, show_weights=True))
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
estimator1.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
importances = estimator1.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
confusion_matrix_sklearn(estimator1, X_test, y_test)
dtree_pre_pruning_test_perf1 = model_performance_classification_sklearn(
estimator1, X_test, y_test
)
dtree_pre_pruning_test_perf1
# Create an instance of the decision tree model
clf = DecisionTreeClassifier(random_state=1)
# Compute the cost complexity pruning path for the model using the training data
path = clf.cost_complexity_pruning_path(X_train, y_train)
# Extract the array of effective alphas from the pruning path
ccp_alphas = abs(path.ccp_alphas)
# Extract the array of total impurities at each alpha along the pruning path
impurities = path.impurities
pd.DataFrame(path)
# Create a figure
fig, ax = plt.subplots(figsize=(10, 5))
# Plot the total impurities versus effective alphas, excluding the last value,
# using markers at each data point and connecting them with steps
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
# Set the x-axis label
ax.set_xlabel("Effective Alpha")
# Set the y-axis label
ax.set_ylabel("Total impurity of leaves")
# Set the title of the plot
ax.set_title("Total Impurity vs Effective Alpha for training set");
Next, we train a decision tree using the effective alphas.
The last value in ccp_alphas is the alpha value that prunes the whole tree,
leaving the corresponding tree with one node.
# Initialize an empty list to store the decision tree classifiers
clfs = []
# Iterate over each ccp_alpha value extracted from cost complexity pruning path
for ccp_alpha in ccp_alphas:
# Create an instance of the DecisionTreeClassifier
clf = DecisionTreeClassifier(ccp_alpha=ccp_alpha, random_state=1)
# Fit the classifier to the training data
clf.fit(X_train, y_train)
# Append the trained classifier to the list
clfs.append(clf)
# Print the number of nodes in the last tree along with its ccp_alpha value
print(
"Number of nodes in the last tree is {} with ccp_alpha {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
# Remove the last classifier and corresponding ccp_alpha value from the lists
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
# Extract the number of nodes in each tree classifier
node_counts = [clf.tree_.node_count for clf in clfs]
# Extract the maximum depth of each tree classifier
depth = [clf.tree_.max_depth for clf in clfs]
# Create a figure and a set of subplots
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
# Plot the number of nodes versus ccp_alphas on the first subplot
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("Alpha")
ax[0].set_ylabel("Number of nodes")
ax[0].set_title("Number of nodes vs Alpha")
# Plot the depth of tree versus ccp_alphas on the second subplot
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("Alpha")
ax[1].set_ylabel("Depth of tree")
ax[1].set_title("Depth vs Alpha")
# Adjust the layout of the subplots to avoid overlap
fig.tight_layout()
train_recall = [] # Initialize an empty list to store Recall scores for training set for each decision tree classifier
# Iterate through each decision tree classifier in 'clfs'
for clf in clfs:
# Predict labels for the training set using the current decision tree classifier
pred_train = clf.predict(X_train)
# Calculate the recall score for the training set predictions compared to true labels
train_recall_score = recall_score(y_train, pred_train)
# Append the calculated recall score to the train_recall list
train_recall.append(train_recall_score)
test_recall = [] # Initialize an empty list to store Recall scores for test set for each decision tree classifier
# Iterate through each decision tree classifier in 'clfs'
for clf in clfs:
# Predict labels for the test set using the current decision tree classifier
pred_test = clf.predict(X_test)
# Calculate the recall score for the test set predictions compared to true labels
recall_test_score = recall_score(y_test, pred_test)
# Append the calculated recall score to the test_recall_scores list
test_recall.append(recall_test_score)
train_scores = [clf.score(X_train, y_train) for clf in clfs] # Calculate training scores for each decision tree classifier
test_scores = [clf.score(X_test, y_test) for clf in clfs] # Calculate testing scores for each decision tree classifier
test_recall
train_recall
ccp_alphas
# the arrays test_recall, train_recall in a datafframe
pd.DataFrame({'test_recall': test_recall, 'train_recall': train_recall, 'ccp_alphas': ccp_alphas})
# Create a figure
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("Alpha") # Set the label for the x-axis
ax.set_ylabel("Recall Score") # Set the label for the y-axis
ax.set_title("Recall Score vs Alpha for training and test sets") # Set the title of the plot
# Plot the training Recall scores against alpha, using circles as markers and steps-post style
ax.plot(ccp_alphas, train_recall, marker="o", label="training", drawstyle="steps-post")
# Plot the testing Recall scores against alpha, using circles as markers and steps-post style
ax.plot(ccp_alphas, test_recall, marker="o", label="test", drawstyle="steps-post")
ax.legend(); # Add a legend to the plot
# creating the model where we get highest test Recall Score
index_best_model = np.argmax(test_recall)
# selecting the decision tree model corresponding to the highest test score
dtree_post_pruning = clfs[index_best_model]
print(dtree_post_pruning)
Check Performance on Training Data
confusion_matrix_sklearn(dtree_post_pruning, X_train, y_train)
dtree_post_pruning_train_perf = model_performance_classification_sklearn(
dtree_post_pruning, X_train, y_train
)
dtree_post_pruning_train_perf
Visualize the Decision Tree
plt.figure(figsize=(10, 13))
out = tree.plot_tree(
dtree_post_pruning,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# importance of features in the tree building
#( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
dtree_post_pruning.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
importances = dtree_post_pruning.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Check performance on test data
confusion_matrix_sklearn(dtree_post_pruning, X_test, y_test)
dtree_post_pruning_test_perf = model_performance_classification_sklearn(
dtree_post_pruning, X_test, y_test
)
dtree_post_pruning_test_perf
# Create a model with the ccp_alpha value
# And balance the data using a proportion that is lose to the inverse of the proportion of the frequency of outcomes
dtree_post_pruning1 = DecisionTreeClassifier(
ccp_alpha=0.001647, class_weight={0: 0.15, 1: 0.85}, random_state=1
)
dtree_post_pruning1.fit(X_train, y_train)
confusion_matrix_sklearn(dtree_post_pruning1, X_train, y_train)
decision_tree_tune_post_train1 = model_performance_classification_sklearn(dtree_post_pruning1, X_train, y_train)
decision_tree_tune_post_train1
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
dtree_post_pruning1,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(dtree_post_pruning1, feature_names=feature_names, show_weights=True))
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
dtree_post_pruning1.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
importances = dtree_post_pruning1.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
confusion_matrix_sklearn(dtree_post_pruning1, X_test, y_test)
decision_tree_tune_post_test1 = model_performance_classification_sklearn(dtree_post_pruning1, X_test, y_test)
decision_tree_tune_post_test1
# training performance comparison
models_train_comp_df = pd.concat(
[decision_tree_perf_train.T, dtree_pre_pruning_train_perf.T, dtree_pre_pruning_train_perf1.T, dtree_post_pruning_train_perf.T, decision_tree_tune_post_train1.T], axis=1,
)
models_train_comp_df.columns = ["Decision Tree (sklearn default)", "Decision Tree (Pre-Pruning: 2,50,10)", "Decision Tree (Pre-Pruning 1: 5,50,100)", "Decision Tree (Post-Pruning: Recall 1)", "Decision Tree (Post-Pruning: Recall < 1, ccp_alpha = 0.001647)"]
print("Training performance comparison:")
models_train_comp_df
# testing performance comparison
models_test_comp_df = pd.concat(
[decision_tree_perf_test.T, dtree_pre_pruning_test_perf.T, dtree_pre_pruning_test_perf1.T, dtree_post_pruning_test_perf.T, decision_tree_tune_post_test1.T], axis=1,
)
models_test_comp_df.columns = ["Decision Tree (sklearn default)", "Decision Tree (Pre-Pruning: 2,50,10)", "Decision Tree (Pre-Pruning 1: 5,50,100)", "Decision Tree (Post-Pruning: Recall 1)", "Decision Tree (Post-Pruning: Recall < 1, ccp_alpha = 0.001647)"]
print("Testing performance comparison:")
models_test_comp_df
Note: Generally, this model doesn't appear production ready (based on the information from our MLS instructor; he says that the Recall and Precision should be 20% or less apart), even with the Recall score of 1 in both training and testing data because of the values of other scores.
The model with the next highest Recall in the train and test data is the second pre-pruned model with parameters (max_depth_values=5, max_leaf_nodes_values=50, min_samples_split_values=100). This model has way better Accuracy, Precision and F1 scores from the first pre-pruned model, although the latter two scores are still a bit sub-standard in my opinion. If those numbers are okay with the client, this model could work.
The default model and the first post-pruned model produce exactly the same performance scores. I am not sure why that would be the case, especially with the post-pruning being applied. The idea is to reduce the tree but it appears not to have done it. Anyway, the Recall score on this model is not as high as the first two models discussed, but is high enough and the other scores are very high, so these would not be bad models to use since it generalizes well. But the tree is the most complex; that might be a point to consider.
Income is the most important feature affect if the customer will get a personal loan. This is the same with all models. The other notable fetures are CCAvg, Family with the chosen model and Education with other models.
Predictiing on a single data point
%%time
# Choosing a data point
#applicant_details = X_test.iloc[:1, :] # Does not Have Personal Loan
applicant_details = X_test.iloc[8:9, :] # Has Personal Loan
#print(applicant_details)
# making a decision
approval_prediction = estimator.predict(applicant_details)
print(approval_prediction)
%%time
# Predict the likelihood
approval_likelihood = estimator.predict_proba(applicant_details)
#print(approval_likelihood)
print(approval_likelihood[0, 1])
Further Observation
I located an actual data point that has a 1 in the Personal_Loan column and used it to run predictions using all the models I created. ONLY the model we decided to go with predicted correctly that this customer would get a Personal Loan. Also, the likelihood of the customer getting a Personal Loan was high at ~96%. All the other models predicted otherwise. So, this validates the model we chose for this purpose, regardless of the values of the other scores. This is huge because a process that may manually take hours to accomplish is taking milliseconds to be done. Wow!
From the decision tree of the chosen model, we observe that if income is greater than 92.5K and the family size is greater than 2.5, the customer is most likely to get a personal loan.
The bank can deploy the model for the process of determining the obvious cases of which customer will or will not get a personal loan, leaving the rather non-obvious cases to be handled manually by the experts. So, this can serve as an initial screening done in an automated manner.
The bank can use the likelihood aspect of the model as a confidence factor to determine if a customer should get the personal loan automatically or if further srutiny is needed, based on the likelihood number using a predetermined threshold.
This process (using the model), will reduce the work load for the bank and reduce the turnaround time for initial screening and also for the whole process of determining which customer will or will not get a personal loan.